Goto

Collaborating Authors

 Computer-Aided Assessment


Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

arXiv.org Artificial Intelligence

Automated Essay Scoring (AES) plays a crucial role in assessing language learners' writing quality, reducing grading workload, and providing real-time feedback. Arabic AES systems are particularly challenged by the lack of annotated essay datasets. This paper presents a novel framework leveraging Large Language Models (LLMs) and Transformers to generate synthetic Arabic essay datasets for AES. We prompt an LLM to generate essays across CEFR proficiency levels and introduce controlled error injection using a fine-tuned Standard Arabic BERT model for error type prediction. Our approach produces realistic human-like essays, contributing a dataset of 3,040 annotated essays. Additionally, we develop a BERT-based auto-marking system for accurate and scalable Arabic essay evaluation. Experimental results demonstrate the effectiveness of our framework in improving Arabic AES performance.


Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

arXiv.org Artificial Intelligence

The rapid development of machine learning (ML) technologies, particularly large language models (LLMs), has led to major advancements in natural language processing (NLP, Abbasi et al. 2023). While much of this advancement happened under the umbrella of the common task framework which espouses transparency and openness (Abbasi et al. 2023), in recent years, closed LLMs such as GPT-3 and GPT-4 have set new performance standards in tasks ranging from text generation to question answering, demonstrating unprecedented capabilities in zero-shot and few-shot learning scenarios (Brown et al. 2020, OpenAI 2023). Given the strong performance of closed LLMs such as GPT-4, many studies within the LLM-as-a-judge paradigm rely on their scores as ground truth benchmarks for evaluating both open and closed LLMs (Chiang and Lee 2023), further entrenching the dominance of SOTA closed LLMs (Vergho et al. 2024). Along with closed LLMs, there are also LLMs where the pre-trained models (i.e., training weights) and inference code are publicly available ("open LLMs") such as Llama (Touvron et al. 2023, Dubey et al. 2024) as well as LLMs where the full training data and training code are also available ("open-source LLMs") such as OLMo (Groeneveld et al. 2024). Open and open-source LLMs provide varying levels of transparency for developers and researchers (Liu et al. 2023). Access to model weights, training data, and inference code enables several benefits for the user-developer-researcher community, including lower costs per input/output token through third-party API services, support for local/offline pre-training and fine-tuning, and deeper analysis of model biases and debiasing strategies. However, the dominance of closed LLMs raises a number of concerns, including accessibility and fairness (Strubell et al. 2020, Bender 2021, Irugalbandara et al. 2024).


Privacy-Preserved Automated Scoring using Federated Learning for Educational Research

arXiv.org Artificial Intelligence

Data privacy remains a critical concern in educational research, necessitating Institutional Review Board (IRB) certification and stringent data handling protocols to ensure compliance with ethical standards. Traditional approaches rely on anonymization and controlled data-sharing mechanisms to facilitate research while mitigating privacy risks. However, these methods still involve direct access to raw student data, posing potential vulnerabilities and being time-consuming. This study proposes a federated learning (FL) framework for automatic scoring in educational assessments, eliminating the need to share raw data. Our approach leverages client-side model training, where student responses are processed locally on edge devices, and only optimized model parameters are shared with a central aggregation server. To effectively aggregate heterogeneous model updates, we introduce an adaptive weighted averaging strategy, which dynamically adjusts weight contributions based on client-specific learning characteristics. This method ensures robust model convergence while preserving privacy. We evaluate our framework using assessment data from nine middle schools, comparing the accuracy of federated learning-based scoring models with traditionally trained centralized models. A statistical significance test (paired t-test, $t(8) = 2.29, p = 0.051$) confirms that the accuracy difference between the two approaches is not statistically significant, demonstrating that federated learning achieves comparable performance while safeguarding student data. Furthermore, our method significantly reduces data collection, processing, and deployment overhead, accelerating the adoption of AI-driven educational assessments in a privacy-compliant manner.


Teach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring

arXiv.org Artificial Intelligence

Multi-trait automated essay scoring (AES) systems provide a fine-grained evaluation of an essay's diverse aspects. While they excel in scoring, prior systems fail to explain why specific trait scores are assigned. This lack of transparency leaves instructors and learners unconvinced of the AES outputs, hindering their practical use. To address this, we propose a self-explainable Rationale-Driven Multi-trait automated Essay scoring (RaDME) framework. RaDME leverages the reasoning capabilities of large language models (LLMs) by distilling them into a smaller yet effective scorer. This more manageable student model is optimized to sequentially generate a trait score followed by the corresponding rationale, thereby inherently learning to select a more justifiable score by considering the subsequent rationale during training. Our findings indicate that while LLMs underperform in direct AES tasks, they excel in rationale generation when provided with precise numerical scores. Thus, RaDME integrates the superior reasoning capacities of LLMs into the robust scoring accuracy of an optimized smaller model. Extensive experiments demonstrate that RaDME achieves both accurate and adequate reasoning while supporting high-quality multi-trait scoring, significantly enhancing the transparency of AES.


PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback

arXiv.org Artificial Intelligence

Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.


EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

arXiv.org Artificial Intelligence

Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research. Our dataset and code will be available upon acceptance.


Improve LLM-based Automatic Essay Scoring with Linguistic Features

arXiv.org Artificial Intelligence

Automatic Essay Scoring (AES) assigns scores to student essays, reducing the grading workload for instructors. Developing a scoring system capable of handling essays across diverse prompts is challenging due to the flexibility and diverse nature of the writing task. Existing methods typically fall into two categories: supervised feature-based approaches and large language model (LLM)-based methods. Supervised feature-based approaches often achieve higher performance but require resource-intensive training. In contrast, LLM-based methods are computationally efficient during inference but tend to suffer from lower performance. This paper combines these approaches by incorporating linguistic features into LLM-based scoring. Experimental results show that this hybrid method outperforms baseline models for both in-domain and out-of-domain writing prompts.


Humanity's Last Exam

arXiv.org Artificial Intelligence

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.


Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring

arXiv.org Artificial Intelligence

In automated essay scoring (AES), recent efforts have shifted toward cross-prompt settings that score essays on unseen prompts for practical applicability. However, prior methods trained with essay-score pairs of specific prompts pose challenges in obtaining prompt-generalized essay representation. In this work, we propose a grammar-aware cross-prompt trait scoring (GAPS), which internally captures prompt-independent syntactic aspects to learn generic essay representation. We acquire grammatical error-corrected information in essays via the grammar error correction technique and design the AES model to seamlessly integrate such information. By internally referring to both the corrected and the original essays, the model can focus on generic features during training. Empirical experiments validate our method's generalizability, showing remarkable improvements in prompt-independent and grammar-related traits. Furthermore, GAPS achieves notable QWK gains in the most challenging cross-prompt scenario, highlighting its strength in evaluating unseen prompts.


How well can LLMs Grade Essays in Arabic?

arXiv.org Artificial Intelligence

This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.